Sequence Annotation with HMMs: 
New Problems and Their Complexity 
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Abstract 

Hidden Markov models (HMMs) and their variants were successfully used for several se- 
quence annotation tasks. Traditionally, inference with HMMs is done using the Viterbi and 
posterior decoding algorithms. However, recently a variety of different optimization crite- 
ria and associated computational problems were proposed. In this paper, we consider three 
HMM decoding criteria and prove their NP hardness. These criteria consider the set of states 
used to generate a certain sequence, but abstract from the exact locations of regions emitted 
by individual states. We also illustrate experimentally that these criteria are useful for HIV 
recombination detection. 
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1 Introduction 

Hidden Markov models (HMMs) and their variants were successfully used for several sequence 
annotation problems in bioinformatics, including gene finding, protein secondary structure predic- 
tion, protein family modeling, detection of conserved elements in multiple alignments and others 
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of these areas, we assume that a particular biological sequence X was generated by the HMM, and 
we wish to infer which states of the model were used to generate particular parts of the sequence 
in a process called HMM decoding. The traditional algorithm for this task is the Viterbi algorithm 



(Viterbi, 1967[ ), which finds the state path (sequence of states) generating sequence X with the 
highest probability. 

Many other decoding criteria were proposed (iHamada and Asai 2012 1. For example, we can 



assign labels to states of the HMM, and then search for the most probable sequence of labels 
instead of the most probable state path. If multiple states can share the same label, this problem 



is NP-hard (Lyngs0 and Pedersen 



(Schwartz and Chow 1990 Krogh 



2002 |Brejova et al. 2007) and heuristics are used in practice 



19971). In effect, we use state labels to group together many 



state paths with the same meaning and then search for the group with the highest probability. In 
some application domains, it may be appropriate to group state paths together in different ways. 
In this paper, we explore three optimization problems of this kind. 

Definition 1.1 (The most probable footprint) The footprint of a state path (or labeling) is 
the list of states ( or labels ) visited on the path, discarding the information about the number of 
successive characters emitted by the same state (or label). The probability of a footprint is the sum 
of probabilities of all paths following the footprint. The task is to find the most probable footprint 
for a given HMM and sequence. 
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Definition 1.2 (The most probable set) The set of a state path (or labeling) is the set of 
states ( or labels ) visited on the path, regardless of their order or multiplicity. The probability of 
a set is the sum of probabilities of all paths sharing the same set. The task is to find the set with 
the highest probability for a given HMM and sequence. 

Definition 1.3 (The most probable restriction) A path obeys a restriction (set of states or 
labels) if it uses only states or labels included in the restriction. The probability of a restriction 
is the sum of probabilities of all paths that obey the restriction. The task is to find the restriction 
of size k with the highest probability for a given HMM and sequence. 

These problems were motivated by the HIV recombination detection problem, which we review 
in Section [2j However, their use is not limited to this application and is appropriate wherever 
exact location of individual regions in the sequence is not important. We demonstrate usefulness 
of these problems in practice even if we use heuristics to solve them. Indeed, exact solution is 
unlikely, since in Sections [3j [4] and [5] we show that all three problems are NP-hard. The most 



probable footprint problem was briefly considered by Brown and Truszkowski (2010), who observe 
that it is polynomially solvable in HMMs with two states or two labels. The other two problems 
were not studied previously. 



Hidden Markov models and notation. In the rest of this section, we introduce the necessary 
notation. A hidden Markov model (HMM) is a generative probabilistic model with a finite set of 
states V and transitions E. The generative process starts by choosing a starting state v\ according 
to the initial state probabilities I{v\). Then in each round, the model emits a single symbol Xi from 
the emission probability distribution e(t>i, Xi) of the current state vt, and then changes the state to 
Vi + \ according to the transition probability distribution a(uj, Uj+i). The process continues for some 
fixed number of steps n. Thus, the joint probability of generating a sequence X = x\, . . . , x n by 
a state path 7T = v\, . . . , v n in an HMM H is Pr(7r, X | H, n) = I{v\) ■ e{v\,Xi) ■ n"=2 a ( u i-i; v i) ' 
e(vi,Xi). In other words, the HMM defines a probability distribution Pr(ir,X | H, n) over all 
possible sequences X and state paths tt of length n. 

Let Y be a sequence over some alphabet such that Y = „" where Xi and Xi+i are 

distinct characters and each kj is greater than zero. Then the footprint f{Y) of this sequence is 
X\X2 ■ ■ ■ x n and its character set s(Y) is {x±, x%, . . . x n } (note that the size of this set can be less 
than n). For example for Y = aabaaacc, we have f(Y) — abac and s(Y) = {a, b, c}. 

In particular, we will apply the footprint and set operators to state paths tt. Probability of a 
footprint F for a given HMM H and sequence X of length n is 

Pv(f(n) = F,X\H,n)= £ Pv(tt, X \ H,n). 

Analogously we also define a probability of a given set of states S denoted as Pr(s(7r) = S, X \ 
H, n) . Note that each path tt included in this probability must use every state in S at least once. 
Finally, we will also discuss the probability of a state restriction S denoted as Pr(s(7r) C S,l 
H,n), where we count all state paths that use only states from set S, but are not required to use 
all of them. 

We can also assign label £(v) to each state v of the HMM. The label l(ir) of a state path tt is 
then concatenation of labels for individual states on the path. We can then use similar notation 
for probability of footprints and sets defined on labelings, such as Pr(/(£(7r)) = F, X \ H,n). 

We will say that a state path tt can generate X if Pr(7r, X\H, n) > 0. Similarly a footprint F 
can generate X if Pr(/(7r) = F, X \ H, n) > and a set of states S can generate X if Pr(s(-7r) = 
S, X | H, n) > 0. 



2 Motivation 

The problems studied in this paper were inspired by the HIV recombination detection problem, 



which was recently successfully approached with jumping HMMs (Schultz et al. 2006). In this 
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Figure 1: Prediction of recombination on artificial recombinant of subtypes A and B (black and 
white) with recombination every 950-1050 bases. HERD decoding yielded regions associated with 
incorrect subtypes (gray color representing 3 different subtypes) and fixing either the set or the 
footprint improved accuracy. 



setting, we represent sequence of each subtype of the HIV virus as a profile HMM, and then we 
combine these profiles to a single HMM by addition of special transitions modeling recombination 
between genomes of different strains of the virus. Given a particular genome, we try to establish 
which portions were generated by which profile. However, it is virtually impossible to determine 
the exact position of the recombination. Therefore we may wish to group together state paths 



that differ in positions of individual recombination points only by a small amount (Nanasi et al. 



2010 Brown and Truszkowski 2010 Truszkowski and Brown 2011). 



In this scenario, each subtype corresponds to one label. Set of a labeling s(£(ir)) corresponds 
to the set of subtypes present in the query sequence X. If we are not interested in the location 
of recombination points, this is the most natural measure to optimize. However, we might be 
interested to also know the order of subtypes along the sequence represented by the footprint of a 
labeling J{£{tt)). 

Additionally, we can use a multi-step decoding strategy, where we first fix a set of labels or a 
footprint, and then refine it to a full labeling by a secondary optimization criterion. This approach 
was taken by Truszkowski and Brown (20111, mainly as a heuristic for speeding up the search. 
Here we show that this two-step strategy can be also useful for improving the prediction accuracy. 



In particular, as a second step we use the highest expected reward decoding (HERD) (Nanasi 



et al. 2010). The method has two important parameters: window size W (breakpoints within this 



distance are considered equivalent) and penalty 7 for false positives (each true positive breakpoint 
is scored +1, false positive breakpoint scores —7). HERD optimizes expected value of this scoring 
function under the assumption that the sequence was generated from the HMM. 

As we can see in Figure [2] the program is very sensitive to the choice of 7: for the optimal 
value of 7 it is significantly more accurate than the Viterbi algorithm, but if we increase 7 too 
much, the performance deteriorates. The most common problem is that HERD predicts too many 
breakpoints when 7 is low (Figure [T]). By fixing a footprint as a constraint in the two-step strategy, 
and then optimizing the HERD criterion only for labelings obeying this footprint, the prediction 
accuracy is virtually independent of 7 and relatively close to the optimum values. Fixing the 
set instead of the footprint yields slightly higher specificity and lower sensitivity compared to 
optimizing HERD directly. Note that the footprints and sets are chosen by a simple heuristic; 
perhaps even better results could be obtained with optimal choice of these constraints. 



3 The Most Probable Footprint 

As previously seen, finding the most probable footprint is a reasonable decoding criterion, and 
it may also serve as a starting point in a multi-stage strategy. In this section we show that this 
problem is NP-hard. In particular, we will consider the footprint of a state path f(n). The 
problem of optimizing the footprint of a labeling /(£(%)) is also NP-hard, because optimizing f(ir) 
is its special case, equivalent to optimizing f(£(Tr)) in an HMM in which each state has a unique 
label. 
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Figure 2: Feature specificity (a) and sensitivity (b) as a function of parameter 7 e [0.1,2] on a 
semi- artificial set. A feature is correctly predicted if its boundaries are within 30 symbols of the 
corresponding feature in the correct annotation. Sensitivity is the proportion of real features that 
were correctly predicted and specificity is the proportion of predicted features that are correct. 
We use HERD parameters Pj = 10 -5 and W = 10. F-HERD optimizes the same criterion 
among labelings obeying the footprint obtained by sampling several paths from the probability 
distribution Pr(-7r | X,H,n), computing the footprint for each path, and then taking the most 
frequently occurring footprint among the samples, using the software by [Truszkows kT and Brown| 
(2011). S-HERD optimizes HERD criterion among labelings using only labels from this footprint. 



The data set consists of 150 artificial recombinants of members of various subtypes of HIV virus 
with recombination every 200-300 residues. 



Theorem 3.1 There is a fixed HMM H such that the following problem is NP-complete: Given 
a sequence X of length n and probability p £ [0, 1], determine if there is a footprint F such that 
PT(f(ir) = F,X\H,n)>p. 

Proof We will prove NP-hardness by a reduction from the maximum clique problem using the 
HMM in Figure [3] with eight states and alphabet £ = {S, S', T, T", #, 0, 1, ?}. 

Let G = (V, E) be an undirected graph with n vertices V = {1, 2, . . . , n}. We will encode it 
in a sequence X over alphabet £ as follows. For every vertex v g V, we create a block X v with 
2n + 3 symbols: X v = S"#6 t ,,i#6„ i2 # ■ • • #K,nW where b hJ = 1 if i = j, b hj =? if (i, j) G E and 
bij = otherwise. Sequence X is a concatenation of blocks for all vertices with additional first 
and last symbols: X = SX1X2 ■ ■ ■ X n T. 

All state paths that can generate X have a similar structure. The first symbol S and several 
initial blocks are generated in state S, one block, say A,, is generated in states S", #, 0, 1, and 
T' and the rest of the sequence, including the final symbol T is generated in state T. We will 
say that a state path with this structure covers the block Aj. Note that state E is never used in 
generating A, its role is to ensure that the probability of self-transition is the same in states S 
and T. All state paths that can generate A have the same probability q = Pt(tt, X \ H, |A|) = 

2— 2n 2 — 2n^— n— 2n 2 — n+1 

We say that a state path tt is a run of footprint F, if tt can generate A, and /(71") = F. Every 
footprint F that can generate A has the following structure: F = 5 f S"#ci#C2# . . . #c„#T'T 
where Cj € {0, 1}. The probability of footprint F is qk where k is the number of its runs. Also 
note that every run of F covers a different Aj, because once Xi is known, the whole path is 
uniquely determined. 

We will now prove that the graph G has a clique of size at least k if and only if there is a 
footprint for sequence A with probability at least qk. First, let R be a clique in G of size at least 
k > 0. Consider the footprint F = S'5'#ci#c 2 # . . . #c n #T'T where c, = 1 if i e R and c t = 
otherwise. For any i € R, there is a run 71^ of F that covers Xi. This run will use state 1 for 
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Figure 3: The HMM from the proof of Theorem |3.1| Each circle denotes one state. The HMM 
always starts in state S. Under each state is the set of symbols that the state emits with non-zero 
probability. Each of these symbols is emitted with probability 1/fc, where k is the size of the set. 
Alternatively, all outgoing transitions from a particular state have the same probability. 

generating each bij such that j G R and thus both 6jj G {?, 1} and Cj = 1. For j ^ R we have 
bij — and Cj — 0, thus they will use state in tt. Since there is a different run for every i £ R, 
footprint F has at least k runs. 

Conversely let F be a footprint with probability at least qk > and thus with at least k runs. 
We will construct a clique of size at least k as follows. Let R be the set of all vertices i such that 
/ has a run that covers Xi. Clearly the size of R is at least k. Since F has non-zero probability, it 
has the form SS'#ci#c 2 . . . #c„#T'T for c, G {0, 1}. For all i G R, c,- = 1 because the i-th block 
has bi i = 1. Therefore for all i,j G R, we have bij G {1, ?}, which means that (i,j) G E or i = j. 
This implies that R is indeed a clique. 

To summarize, given graph G and threshold k, we can compute in polynomial time sequence 
X and threshold qk such that G has a clique of size at least k if and only if sequence X has a 
footprint with probability at least qk. This completes our reduction. 

The problem is in NP (even if HMM is not fixed, but given on input), because given an HMM 
H, sequence X and a footprint F, we can compute the probability Pr(/(7r) = F, X \ H, \X\) in 
polynomial time by a dynamic programming algorithm which considers all prefixes of X and all 
prefixes of F. If probability p and parameters of HMMs are given as rational numbers, we can 
compute all quantities without rounding in polynomial number of bits. | 

4 The Most Probable Set of States 

In this section, we prove NP-hardness of finding the most probable set of states. Again, as with 
footprint, this is a special case of the problem of finding the most probable set of labels. 

Theorem 4.1 The following decision problem is NP-hard: Given an HMM H , sequence X of 
length n, and a number p G [0, 1], decide if there exists a set of states S such that Pr (s(tt) = S, X \ H,ri) > 
P- 

To prove this theorem, we will use a reduction from the maximum clique problem. Given a 
graph G = (V,E) and a clique size k, we first choose a suitable threshold k' > k, as detailed below, 
and construct a graph G' = (V',E r ) such that G' has a clique of size k' if and only if G has a 
clique of size k. This is achieved simply by adding k' — k new vertices and connecting each of the 
new vertices to all other vertices in V' . As long as k! — k is not too large, this transformation can 
be done in polynomial time. 

In the next step, we use G' and k' to construct an HMM, an input sequence and a probability 
threshold. We will use the following straightforward way of converting a graph to an HMM. 

Definition 4.1 Let G — (V, E) be an undirected graph (without self-loops). Then the graph HMM 
Hq is defined as follows: 

• Rs set of states is V U {ip}, where tp ^ V is a new state called the error state. 

• Rs emission alphabet is {0, 1}. 
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• Each state v £ V has initial probability I(v) = l/\V\, the error state has initial probability 

• Each state v £ V emits with probability 1, the error state emits 1 with probability 1. 

• Transitions with non-zero probability between states u, v € V correspond to edges in E, more 
precisely: 

a(u, v) = < 11 

I otherwise 

• For u 6 V, we also have a(u, ip) — 1 — Y^veV a ( u ' v ) an( ^ a ("^' u ) = ^* The error state has a 
self-transition with probability 1: a(ip,ip) = !■ 

The error state tp is added to the HMM so that all non-zero transitions between states in 
V have the same probability. Any state path 7r containing only states from V connected by 
transitions with non-zero probability has the same probability of generating sequence X = 0": 
Pv(tt,X = n \Ha,n) = Such paths correspond to walks in graph G. 

Therefore, we will be interested in counting the number of walks in different graphs. Let 
Y(n, G) be the number of walks of length n — 1 in a graph G = (V, E) that visit every vertex 
from G at least once. Note that a walk of length n — 1 contains n — 1 edges and n vertices, and 
therefore Y(n, G) = for n < \V\. As a special case we consider D{n, k) = Y(n, K^), where is 
the complete graph with k vertices. The following claim clearly holds: 

Lemma 4.2 If G is a graph with k vertices and n > k, then Y(n, G) < D(n, k) with equality only 
forG = K k . 

In our reduction we use HMM H = He and X = 0" for a suitable choice of n discussed below. 
As threshold p we will use the value D(n, k')/\V\ n . Clearly, if the input graph G has a clique S of 
size k, graph G' has a clique S' of size k' . There are at least D(n, k') walks of length n — 1 that 
use only vertices in S' and visit each of them at least once. Each of such walks corresponds to one 
state path, and therefore the probability of the set of states S' is exactly p. 

In order to prove the opposite implication, we need suitable choices of n and kl . Table [l] shows 
values of D(n, k) for small values of n and k. For a fixed length of walk n, the number of walks 
in Kk initially grows with increasing k, as we have more choices which vertex to use next, but 
as k approaches n, D(n, k) may start to decrease, because the walks are more constrained by the 
requirement to cover every vertex. We are particularly interested in the value of k where D(n, k) 
achieves the maximum value for a fixed n. In particular we use the following notation: 

M n = min < k; D(n, k) = max D(n, k') > 

I Q<k'<n J 

Note that if there are multiple values of k achieving maximum, we take the smallest one as M n . 
In our reduction, we would like to set n to be the smallest value such that M n = k, but we were 
not able to prove that such n exists for each k. Therefore we choose as n the smallest value such 
that M n > fc, and we denote this value n^. As kl we then use M nk . The following lemma states 
important properties of and M nk . 

Lemma 4.3 The value of is at most \k\nk~\ and nk and M nk can be computed in 0(k°^ 1 >) 
time. 

Before proving this lemma, we finish the proof of the reduction. Let us assume that there is 
a set of states S such that Pr(s(7r) = S,X\H,n) > p. This means that if we consider walks in 
the subgraph G'(S) induced by the set S, we get Y(n,G' (S)) > D(n,k'). We will consider three 
cases: 

• If 5 is a clique and \S\ > k', we have the desired clique in graph G', and therefore there is 
also a clique of size k in graph G. 
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Table 1: Values of D(n,k), rif., M n , and M„ fc for small values of n and fc. Empty cells contain 
zeros. 



• If S is a clique and 15*1 < fc', then by definition of M n we have Y(n,G'(S)) = D(n, \S\) < 
D(n, M n ) — D(n, k'). This is a contradiction with our assumption. 



• If S is not a clique, then by Lemma 4.2 and definition of M n we have Y(n,G' (S)) < 
D(n,K\s\) < D(n,M n ) = D(n,k'). Again we get a contradiction with the inequality 
Y(n,G'(S)) > D(n,k'). 

Therefore we have proved that G contains a clique of size k if and only if the most probable set 
of states in Hqi that can generate X has probability at least p. Moreover, we can construct nk, 
M nkl Hqi, X, and p in polynomial time. 

To complete this proof we need to prove Lemma |4.3| We start by proving another useful 
lemma. 



Lemma 4.4 For 2 < k < n the following recurrence holds: 

D(n, k) = (k- l)D(n -l,k) + kD(n — l,k— 1). 

In addition, D(n, n) — n\, D(n, 1) = for n > 1, and D(n, k) = for k > n. 

Proof Clearly, D(n, n) = nl since walks of length n — 1 correspond to permutations of vertices. 
If n > 1 then D(n, 1) = 0, since K\ does not contain any edges. If k > n, D(n, k) — since a walk 
of length n — 1 can pass through at most n vertices. 

Now let 2 < k < n. Denote as v{w) the number of different vertices covered by walk w. Let w 
be a walk of length n — 1 with v(w) = k and let w' be a walk obtained by taking the first n — 1 
vertices of walk w. Then v(uu') is either k or k — 1. 

Every walk w 1 of length n — 2 with = k can be extended to a walk w of length n — 1 in 

-ff/c in fc — 1 ways, because as the last vertex of w we can use any vertex except the last vertex of 
w' . Therefore there are (fc — l)D{n — 1, fc) different walks w in with property v(w') = k. 

On the other hand if v(w') = k — 1, we can create a walk w" in Kk-i by renumbering the 
vertices in w' so that only numbers {1, . . . , fc — 1} are used (if the vertex missing in w' is i, we 
replace j by j — 1 for every vertex j > i). The same representative w" is shared by fc different walks 
w, because to create w from w" , we need to choose the missing vertex i from all fc possibilities, 
renumber vertices to get w' and then to add the missing vertex i at the end of the walk. Therefore 
there are kD(n — 1, fc — 1) walks with the property v(w') = k — 1. Combining the two cases we 
get the desired recurrence. | 
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Proof of lemma 03] Assume that fc > 3. Clearly, D(n,k) < k(k - l)™" 1 , since k(k - l)™" 1 is 
the number of all walks of length n — 1 in K^. However, this number includes also walks avoiding 
some vertices. The number of such walks can be bounded from above by k(k — l)(fe — 2)™" 1 where 
we choose one of the k vertices to avoid and then consider all possible walks on the remaining k — 1 
vertices. In this way we count some walks multiple times, nonetheless by Bonferroni inequality we 
obtain bound D(n, k) > k(k - l)"" 1 - k(k -l)(k- 2)™" 1 . 

For k > 4 we therefore have that if (fc - l)(fe - < k(k - l)™" 1 - k(k - l){k - 2) n_1 , then 

D(n,k — 1) < D(n,k). By taking logarithm of both sides of the inequality we obtain n > /(fc) 

where /(fc) — 1 + in(k-i)-\n(k-2) • Let n — \f{k)~\ for some k > 4 and consider row n in Table 
[I] We have that D(n,k — 1) < D(n,k) and since function / is increasing, we also we have that 
D(n,k' — 1) < D(n,k') for all fc' < fc (we have proved it only for k' > 4, but it is easy to see 
that it is also true for 2 < k' < 3). The maximum in row n is therefore achieved at some position 
M n > k. Recall, that is the smallest n such that M n > k. Therefore < \f(k)~\ . The function 
fclnfc//(fc) is decreasing and its limit is 1 as k approaches oo. Therefore [/(fc)] < [fclnfc], which 
gives us the inequality < [fclnfc]. This inequality can also be easily verified for fc < 4. Since 
M n < n, we also have M nk < \k In fc] . 

We can compute and M nk by filling in table D(m,j) for all values of m and j up to 



[fclnfc] using the recurrence from lemma 4.4 Since D(n,k) < fc™ < n n , we can store D(m,j) 
in 0(fcpolylog(fc)) bits. Therefore computing the desired values and M nk can be done in 
polynomial time. | 

By using the same reduction as in Theorem [4j we can also prove NP-hardness of the following 
variant of the problem, in which we restrict the size of the set of states S. 

Corollary 4.1 The following problem is NP-hard: Given is an HMM H , sequence X of length 
n, integer k and a number p E [0, 1] and the task to decide if there exists a set of states S of size 
exactly k such that Pr (s(tt) — S, X \ H,n) > p. 

Note that it is not clear if the most probable set of states problem is in NP. In particular, given 
a set of states S, it is NP-hard to find out if its probability is greater than some threshold p, even 
if this threshold is 0, as we show next. 

Theorem 4.5 Given HMM H, sequence X of length n and a subset of state space S, the problem 
of deciding ifPr(s(n) = S,X \ H,n) is non-zero is NP-complete. 



Proof Let G — (V, E) be a graph and Hq be the corresponding graph HMM as in Definition 4.1 
Let X = O' 17 '. Any state path that can generate X and contains all vertices from V contains each 
vertex exactly once. It is easy to see that Pr (s(ir) = V,X\ Hq, \X\) > if and only if G contains 
a Hamiltonian path. | 

Unlike the most probable footprint problem, which was NP-hard even for a fixed HMM of a 
constant size, the most probable set problem is fixed-parameter tractable with respect to the size 
of the HMM. Given an HMM with m states and a sequence of length n, we can find the most 
probable set of states in time 0(2 m m 2 n) by a dynamic programming algorithm similar to the 
Forward algorithm. We define F[i, S, v] to be the sum of probabilities of all states paths tt of 
length i such that s(-7r) = S, 7T ends in state v and generates the first i characters of sequence X. 
To compute F[n, S, v] we use the following equation: 



F[i,S,v] 



' I(v)e(v,X[l}) i=l,S = {v} 

a ( M > V M V > (Fii - !> S\{v}, u] + F[i -l,S,u}) i>l,veS 

otherwise 
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5 The Most Probable State Restriction 



In the most probable set problem, we consider only paths that use each state in the set. In 
some situations it is more natural to allow paths to use only some of these states, as in the most 
probable restriction problem. However, the full set of states of the model is trivially the most 
probable restriction. To get a meaningful problem definition, we restrict the size of the restriction 
to be k. As we will show, this problem is also NP-hard. 

Theorem 5.1 The following problem is NP-complete: Given is an HMM H, sequence X, integer 
k and number p G [0, 1]. Determine if there is a subset of states S of size k such that Pr(s(-7r) C 
S,X\H,\X\)>p. 

Proof We will prove NP-hardness by a reduction from 3-SAT. Consider an instance of 3-SAT 
with the set of variables U = {ui, u 2 , ■ ■ ■ , u n } and the set of clauses C = {ci, C2, . . . , c m }. Based 
on sets U and C, we construct an HMM H as follows. The set of states V will contain all positive 
and negative literals. The emission alphabet £ contains all clauses, all variables and a special 
error symbol tp. The initial probability I(v) of each state is l/(2n), and the transition probability 
a(u, v) between any two states is also l/(2n). State for a literal u emits with probability 1/|S| 
every clause that contains u. State for literal u also generates the positive form of the literal with 
probability Finally, to achieve the sum of emission probabilities to be one, it also generates 

the error symbol with probability 1 — X^gcuc/ e ( w ' x )' 

Based on the SAT instance, we also create string X = u\U2 ■ ■ ■ u n c\C2 . . . c m and set the size of 
the restriction k to equal the number of variables n. Every state path tt that can generate X has 
probability (2n|£|)~l x l , we set threshold p to this value. The first part of sequence X contains 
all variables, and variable it, can be generated only by states Ui and Ui. Therefore one of these 
two states needs to be in the path. Since the first portion of the path already traverses k different 
states, only these states can be used to emit the second part of the sequence. Every clause can be 
emitted only by states for literals that satisfy it. The set of states used by a particular state path 
with non-zero probability therefore corresponds to a satisfying assignment in a straightforward 
way. The HMM has a restriction of size k with probability at least p if and only if the 3-SAT 
instance has a satisfying assignment. 

Note that given a restriction S, we can easily verify if its probability is at least p by a variant 
of the Forward algorithm in which we allow only states in S. Therefore the problem is in NP. | 



6 Conclusion 



In this paper, we have proved NP-hardness of three HMM decoding problems. The most probable 
footprint problem can be viewed as a special case of the most probable ball problem under the 
border shift distance considered by Brown and Truszkowski (20101. In this problem, we sum 



probabilities of all labelings that have the same footprint and differ in positions of all feature 



boundaries by at most d. Brown and Truszkowski (2010) observe that if the HMM is allowed to 



contain multiple states of the same label, the most probable ball problem is NP-hard even for 
d = 0. If d > n, where n is the length of the input sequence, the most probable ball problem 
is equivalent to the most probable footprint problem. Therefore, our results imply NP hardness 
of the most probable ball problem for large values of d even in HMMs in which each state has a 
unique label. However, it is open if the problem is NP hard even for small values of d in such 
HMMs. 

In spite of their hardness, we have demonstrated that the studied problems do have practical 
applications, even if we have to resort to heuristics in order to solve them. From a practical 
point of view, it would be useful to explore better heuristic approaches, or even approximation 
algorithms with provable bounds. It is also of interest to study if polynomial algorithms exist for 
some special classes of HMMs. For example, as pointed out by Brown and Truszkowski (2010), 



the most probable footprint problem is polynomially solvable in HMMs with two states or two 
labels, because a sequence of length n has only 2n possible footprints. 



9 



Acknowledgments. This research was supported by European Community FP7 grants IRG- 
224885 and IRG-231025, grant 1/1085/12 from VEGA and Comenius University grant UK/465/2012. 

References 

Brcjova, B., Brown, D. G., and Vinaf, T. (2007). The most probable annotation problem in HMMs 
and its application to bioinformatics. Journal of Computer and System Sciences, 73(7):1060- 
1077. 

Brown, D. G. and Truszkowski, J. (2010). New decoding algorithms for hidden Markov models 
using distance measures on labellings. BMC Bioinformatics, 11(S1):S40. 

Burge, C. and Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. 
J Mol Biol, 268(1) : 78-94. 

Hamada, M. and Asai, K. (2012). A classification of bioinformatics algorithms from the viewpoint 
of maximizing expected accuracy (MEA). J Comput Biol, 19(5):532-539. 

Krogh, A. (1997). Two methods for improving performance of an HMM and their application for 
gene finding. In Intelligent Systems for Molecular Biology (ISMB 1997), pages 179-186. 

Krogh, A., Larsson, B., von Hcijnc, G., and Sonnhammer, E. L. (2001). Predicting transmembrane 
protein topology with a hidden Markov model: application to complete genomes. Journal of 
Molecular Biology, 305(3):567-570. 

Lyngs0, R. B. and Pederscn, C. N. S. (2002). The consensus string problem and the complexity of 
comparing hidden Markov models. Journal of Computer and System Sciences, 65(3):545-569. 

Nanasi, M., Vinaf, T., and Brejova, B. (2010). The highest expected reward decoding for HMMs 
with application to recombination detection. In Combinatorial Pattern Matching (CPM 2010), 
volume 6129 of Lecture Notes in Computer Science, pages 164-176. Springer. 

Schultz, A.-K., Zhang, M., Leitner, T., Kuikcn, C, Korber, B., Morgenstern, B., and Stanke, M. 
(2006). A jumping profile hidden Markov model and applications to recombination sites in HIV 
and HCV genomes. BMC Bioinformatics, 7:265. 

Schwartz, R. and Chow, Y.-L. (1990). The JV-best algorithms: an efficient and exact procedure 
for finding the N most likely sentence hypotheses. In ICASSP: Acoustics, Speech, and Signal 
Processing, pages 81-84, vol. 1. 

Siepel, A. et al. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast 
genomes. Genome Res, 15(8):1034-1040. 

Sonnhammer, E. L., Eddy, S. R., and Durbin, R. (1997). Pfam: a comprehensive database of 
protein domain families based on seed alignments. Proteins, 28(3):405-410. 

Truszkowski, J. and Brown, D. G. (2011). More accurate recombination prediction in HIV-1 using 
a robust decoding algorithm for HMMs. BMC Bioinformatics, 12:168. 

Vitcrbi, A. J. (1967). Error bounds for convolutional codes and an asymtotically optimum decoding 
algorithm. IEEE Transactions on Information Theory, IT-13:260-267. 



10 



